Skip to content

[draft] [fix] [client] fix ack failed when consumer is reconnecting #21928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

poorbarcode
Copy link
Contributor

Motivation

The ack command will fail when the consumer is reconnecting when setting acknowledgmentGroupTime ->0. see https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/PersistentAcknowledgmentsGroupingTracker.java#L361-L364

private CompletableFuture<Void> doImmediateAck(MessageIdAdv msgId, AckType ackType, Map<String, Long> properties,
                                               BitSetRecyclable bitSet) {
    ClientCnx cnx = consumer.getClientCnx();

    if (cnx == null) {
        return FutureUtil.failedFuture(new PulsarClientException
                .ConnectException("Consumer connect fail! consumer state:" + consumer.getState()));
    }
    return newImmediateAckAndFlush(consumer.consumerId, msgId, bitSet, ackType, properties, cnx);
}

Modifications

Call ack after the consumer is connected.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode self-assigned this Jan 18, 2024
@poorbarcode poorbarcode added the type/bug The PR fixed a bug or issue reported a bug label Jan 18, 2024
@poorbarcode poorbarcode added this to the 3.3.0 milestone Jan 18, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jan 18, 2024
@poorbarcode poorbarcode added the category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost label Jan 18, 2024
// batch without ackSet.
{CommandAck.AckType.Individual, new BatchMessageIdImpl(1,1,1,0)},
{CommandAck.AckType.Cumulative, new BatchMessageIdImpl(1,1,1,0)},
// batch with ackSe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// batch with ackSe.
// batch with ackSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

public void testImmediateAckWhenReconnecting(CommandAck.AckType ackType, MessageId messageId) throws Exception {
final String topic = BrokerTestUtil.newUniqueName("persistent://my-property/my-ns/tp_");
final String subscriptionName = "s1";
PulsarClient delayConnectClient = createDelayReconnectClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client should be closed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -871,6 +871,7 @@ public CompletableFuture<Void> connectionOpened(final ClientCnx cnx) {
if (!(firstTimeConnect && hasParentConsumer) && getCurrentReceiverQueueSize() != 0) {
increaseAvailablePermits(cnx, getCurrentReceiverQueueSize());
}
acknowledgmentsGroupingTracker.afterConsumerReconnected();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call flush() method directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +257 to +258
if (!queueDueToConnecting
&& (acknowledgementGroupTimeMicros == 0 || (properties != null && !properties.isEmpty()))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it break the behavior of

// We cannot group acks if the delay is 0 or when there are properties attached to it. Fortunately that's an
// uncommon condition since it's only used for the compaction subscription.

If the consumer is reconnecting but the ack has properties. We will also group the acks which is not expected?

Copy link
Contributor

@BewareMyPower BewareMyPower Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codelipenghui Yes. But this behavior is only added for the TwoPhaseCompactor to avoid latency when seeking, the client side has no way to attach properties. see

.thenCompose((v) -> reader.acknowledgeCumulativeAsync(lastReadId,
Map.of(COMPACTED_TOPIC_LEDGER_PROPERTY, ledger.getId())))

Currently, if acknowledge is called during reconnection, phaseTwoSeekThenLoop will fail.

From my perspective, we should also queue the ACK requests and flush them after connected. @poorbarcode

Copy link
Contributor Author

@poorbarcode poorbarcode Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agress with @BewareMyPower

I am trying to split this PR into the following.

  • part-1: add a new component AcknowledgmentCache to support caching the acknowledgments which include the arg properties.
  • part-2: split PersistentAcknowledgmentsGroupingTracker into two implementations:
    • cache and batch the acks
    • immediately ack
  • part-3: fix the issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked current PR as Draft

Copy link
Contributor

@BewareMyPower BewareMyPower Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split PersistentAcknowledgmentsGroupingTracker into two implementations

It's reasonable. BTW, the C++ client is already implemented in such way that:

https://github.com/apache/pulsar-client-cpp/blob/72b7311aeef32e28a28e926da686aaf948e8f948/lib/ConsumerImpl.cc#L201C1-L215C6

Though the naming is not good (but consistent with other impl classes in the library)

private CompletableFuture<Void> doIndividualAck(MessageIdAdv messageId, Map<String, Long> properties) {
if (acknowledgementGroupTimeMicros == 0 || (properties != null && !properties.isEmpty())) {
private CompletableFuture<Void> doIndividualAck(MessageIdAdv messageId, Map<String, Long> properties,
boolean queueDueToConnecting) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method invocations like f(args..., true) and f(args..., false) are hard to read unless you jump to the implementation of f and see what the boolean parameter means.

You should add a new method like queueIndividualAck and call it directly. For example,

    private CompletableFuture<Void> queueIndividualAck(MessageIdAdv messageId) {
        Optional<Lock> readLock = acquireReadLock();
        try {
            doIndividualAckAsync(messageId);
            return readLock.map(__ -> currentIndividualAckFuture).orElse(CompletableFuture.completedFuture(null));
        } finally {
            readLock.ifPresent(Lock::unlock);
            if (pendingIndividualAcks.size() >= maxAckGroupSize) {
                flush();
            }
        }
    }

    private CompletableFuture<Void> doIndividualAck(MessageIdAdv messageId, Map<String, Long> properties) {
        if (acknowledgementGroupTimeMicros == 0 || (properties != null && !properties.isEmpty())) {
            // We cannot group acks if the delay is 0 or when there are properties attached to it. Fortunately that's an
            // uncommon condition since it's only used for the compaction subscription.
            return doImmediateAck(messageId, AckType.Individual, properties, null);
        } else {
            return queueIndividualAck(messageId);
        }
    }

Then you don't have to add false argument to all existing doIndividualAck calls. And in doImmediateAck, you can call queueIndividualAck directly.

    private CompletableFuture<Void> doImmediateAck(MessageIdAdv msgId, AckType ackType, Map<String, Long> properties,
                                                   BitSetRecyclable bitSet) {
        ClientCnx cnx = consumer.getClientCnx();

        if (cnx == null && consumer.getState() == HandlerState.State.Connecting) {
            if (ackType == AckType.Cumulative) {
                return queueCumulativeAck(msgId);
            } else {
                return queueIndividualAck(msgId);
            }
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will change the implementation of this PR as #21928 (comment)

@poorbarcode poorbarcode changed the title [fix] [client] fix ack failed when consumer is reconnecting [draft] [fix] [client] fix ack failed when consumer is reconnecting Jan 21, 2024
@coderzc coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024
@lhotari lhotari modified the milestones: 4.0.0, 4.1.0 Oct 11, 2024
@lhotari
Copy link
Member

lhotari commented Oct 14, 2024

Please rebase

@lhotari lhotari added the triage/lhotari/important lhotari's triaging label for important issues or PRs label Oct 14, 2024
@lhotari
Copy link
Member

lhotari commented Nov 22, 2024

@poorbarcode Please rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost doc-not-needed Your PR changes do not impact docs release/2.10.6 release/2.11.4 release/3.0.12 release/3.1.3 release/3.2.1 triage/lhotari/important lhotari's triaging label for important issues or PRs type/bug The PR fixed a bug or issue reported a bug
Development

Successfully merging this pull request may close these issues.

5 participants